RNA Aligner

An example of how a RNA nucleotide sequence might align with a genome

Sequence alignment is an important problem in bioinformatics: simply put, given a genetic sequence, in this case RNA, where is its best match in a larger sequence, such as an entire genome? Finding an exact match can be done in linear time , and even inexact matches can be found in linear time if you bound the number of mismatches. However, the task is not always this simple: nucleotide substitution may have taken place anywhere in either sequence, and long stretches of DNA (known as introns) may have been inserted in the middle of the genomic sequence. Perhaps the biggest issue is that the genome is very large - a human genome is by itself ~3 GB, so even a polynomial space/time algorithm on the size of the genome may be unrealistic.

The approach this algorithm takes is similar to that of Bowtie2 with some simplifications and modifications for the problem at hand. It is able to find an alignment in under 0.5 seconds, and using FM-index compression, space requirements are low - the genome sequence is only needed during initialization and can be discarded when finding an alignment.

The (simplified) steps of the algorithm are as follows:

Break the read sequence into overlapping seeds of length 16 every 10 positions.
For a given isoform, align each seed greedily to minimize mismatches. A naive algorithm would take quadratic time, but it can be done in linear time using the FM-index.
Discard all seeds with more than a threshold number of mismatches.
Choose a subset of seeds using a weighted random selection based on the number of mismatches associated with its best alignment.
Expand each seed chosen to the full length of the read sequence, to get a set of alignments for the read sequence with the isoform. Make sure to remove any introns from the alignment.
Count the number of mismatches in each alignment to get the “best” alignment for that isoform.
Repeat for each isoform to find the best alignment for the read sequence across the entire genome.

I mostly wrote this to explore the mechanics behind the operation of various sequence alignment algorithms, and to see how the implementation might vary if you’re working under a different set of assumptions. It is not necessarily intended to be used for real-world applications as is, but with some optimizations it could form the basis for a usable tool.